power-law dynamic
Large language models transition from integrating across position-yoked, exponential windows to structure-yoked, power-law windows
Modern language models excel at integrating across long temporal scales needed to encode linguistic meaning and show non-trivial similarities to biological neural systems. Prior work suggests that human brain responses to language exhibit hierarchically organized "integration windows" that substantially constrain the overall influence of an input token (e.g., a word) on the neural response. However, little prior work has attempted to use integration windows to characterize computations in large language models (LLMs). We developed a simple word-swap procedure for estimating integration windows from black-box language models that does not depend on access to gradients or knowledge of the model architecture (e.g., attention weights). Using this method, we show that trained LLMs exhibit stereotyped integration windows that are well-fit by a convex combination of an exponential and a power-law function, with a partial transition from exponential to power-law dynamics across network layers. We then introduce a metric for quantifying the extent to which these integration windows vary with structural boundaries (e.g., sentence boundaries), and using this metric, we show that integration windows become increasingly yoked to structure at later network layers. None of these findings were observed in an untrained model, which as expected integrated uniformly across its input. These results suggest that LLMs learn to integrate information in natural language using a stereotyped pattern: integrating across position-yoked, exponential windows at early layers, followed by structure-yoked, power-law windows at later layers. The methods we describe in this paper provide a general-purpose toolkit for understanding temporal integration in language models, facilitating cross-disciplinary research at the intersection of biological and artificial intelligence.
Power-law Dynamic arising from machine learning
Chen, Wei, Du, Weitao, Ma, Zhi-Ming, Meng, Qi
We successfully train deep neural networks (DNN) and achieve big breakthroughs in AI tasks, such as computer vision [7, 8, 14], speech recognition [21, 23, 24] and natural language processing [5, 26, 27], etc. Stochastic gradient descent (SGD) is a mainstream optimization algorithm in deep machine learning. Specifically, in each iteration, SGD randomly sample a mini batch of data and update the model by the stochastic gradient. For large DNN models, the gradient computation over each instance is costly. Thus, compared to gradient descent which updates the model by the gradient over the full batch data, SGD can train DNN much more efficiently.
Dynamic of Stochastic Gradient Descent with State-Dependent Noise
Meng, Qi, Gong, Shiqi, Chen, Wei, Ma, Zhi-Ming, Liu, Tie-Yan
Stochastic gradient descent (SGD) and its variants are mainstream methods to train deep neural networks. Since neural networks are non-convex, more and more works study the dynamic behavior of SGD and the impact to its generalization, especially the escaping efficiency from local minima. However, these works take the over-simplified assumption that the covariance of the noise in SGD is (or can be upper bounded by) constant, although it is actually state-dependent. In this work, we conduct a formal study on the dynamic behavior of SGD with state-dependent noise. Specifically, we show that the covariance of the noise of SGD in the local region of the local minima is a quadratic function of the state. Thus, we propose a novel power-law dynamic with state-dependent diffusion to approximate the dynamic of SGD. We prove that, power-law dynamic can escape from sharp minima exponentially faster than flat minima, while the previous dynamics can only escape sharp minima polynomially faster than flat minima. Our experiments well verified our theoretical results. Inspired by our theory, we propose to add additional state-dependent noise into (large-batch) SGD to further improve its generalization ability. Experiments verify that our method is effective.